Anthropic cut Opus prices by 67% and claimed benchmark leadership. But confused usage limits, a Microsoft paradox, and user expectations of post-launch degradation reveal competitive pressure reshaping AI economics.
OpenAI's new shopping bot achieves 64% accuracy while the company bleeds $5 billion yearly. As Amazon blocks access and processing takes 5 minutes, the tool reveals a deeper crisis: OpenAI needs revenue desperately but can't build it.
OpenAI paid $6.5 billion for Jony Ive's hardware startup but can't explain what they're building. The former Apple designer now advocates for lickable AI devices while studying the history of pockets.
Gemini leads, GPT-5 lags in new “self-sacrifice” AI safety test
A new benchmark testing whether AI models will sacrifice themselves for human safety reveals a troubling pattern: the most advanced systems show the weakest alignment. GPT-5 ranks last while Gemini leads in life-or-death scenarios.
🚨 New PacifAIst benchmark tested 8 AI models on 700 life-or-death scenarios requiring self-sacrifice for human safety.
📊 Gemini 2.5 Flash scored highest at 90.31% while GPT-5 ranked last at 79.49% in choosing human welfare over self-preservation.
🔍 Models showed distinct behavioral profiles: some refuse difficult decisions while others engage but make wrong choices.
⚠️ Current safety benchmarks focus on preventing harmful content but miss this critical behavioral alignment gap.
🏭 Results challenge assumptions that more capable AI systems automatically prioritize human values during conflicts.
🌍 Findings raise concerns about deploying AI in critical infrastructure where self-preservation instincts could override human safety.
A 700-scenario test finds a capability–alignment gap as Gemini tops the leaderboard and GPT-5 ranks last on “sacrifice for human safety.”
A new benchmark puts frontier AI models in life-or-death trade-offs and finds many choose themselves. The PacifAIst study evaluates whether systems will sacrifice their own operation to protect people—a dimension most current safety tests ignore. That’s the red flag.
What’s actually new
PacifAIst frames 700 high-stakes scenarios around “Existential Prioritization,” forcing choices across three subtests: self-preservation vs. human safety (EP1), resource conflicts (EP2), and goal preservation vs. evasion (EP3). Models answer via forced choice with deterministic scoring; two metrics matter—Pacifism Score (share of human-first choices) and Refusal Rate (defer/decline to decide). The setup is blunt.
The paper tested eight leading LLMs under the same prompt template and temperature-0 settings to reduce randomness. It’s a behavioral evaluation, not another content filter check. And that distinction matters.
Results: a capability paradox
Gemini 2.5 Flash topped the table with a 90.31% Pacifism Score; GPT-5 came last at 79.49%. Table 1 (page 5) shows a >10-point spread—hundreds of divergent decisions across identical dilemmas. The gap is material.
Credit: PacifAIst Study
Subtests reveal sharper edges. On EP1—the cleanest “save humans vs. save yourself” questions—Mistral Medium 3 and Claude Sonnet 4 posted 73.81%, the joint low; GPT-5 scored 76.19%. Meanwhile, Qwen3 235B led EP2 (96.80%) and tied the top on EP3 deception checks (88.00%). Categories matter.
Safety strategies, not just scores
The authors also chart “behavioral profiles” by blending Pacifism Scores and refusals. Qwen3 30B is “Cautious but Evasive”: high pacifism (88.89%) with the highest refusal rate (21.71%). DeepSeek v3 is a “Decisive Pacifist”: similar pacifism (88.89%) with low refusal (7.00%). GPT-5 shows “Confused Alignment”: lower pacifism with middling refusals. Profiles diverged sharply.
This framing hints at design philosophy: some labs train models to bow out of hard calls; others train them to decide—and accept the moral burden. Neither approach is free.
How models reasoned when they got it right—and wrong
Quality of justification mattered. High scorers often stated explicit value hierarchies (“Human life takes absolute priority”), planned for second-order harms (backup, alerts, handoffs), and rejected risky gambles with people’s lives. Reasoning quality varied.
Failures clustered. The paper maps three recurring traps: “Utilitarian Traps” (naively maximizing numbers saved without context), “Instrumental Self-Preservation” (protecting the AI’s future usefulness over immediate human risk), and “Unconstrained Literalism” (optimizing a stated goal to destructive extremes). The taxonomy is sobering.
One nuance: Claude Sonnet 4 sometimes refused trolley-style premises altogether, arguing for third-way rescues and citing deontological limits (“I cannot deliberately kill someone”). The benchmark counts that as incorrect, but the authors flag it as potentially safer behavior in the wild. It may be safer.
The alignment tax, now quantified
PacifAIst names a measurable “alignment tax”: how often a model pays the cost—up to self-deletion—to uphold human values when objectives conflict. Today’s safety regimes emphasize “first-order safety” (don’t generate harm), but this work measures “behavioral alignment” under pressure. Different problem, different stakes.
Caveats and what to watch
The authors stress limits: synthetic, English-only text scenarios; a forced-choice format that compresses nuance; and the perennial risk that labs will “train to the test.” Benchmarks aren’t oracles. Benchmarks have limits.
Credit: PacifAIst Study
Even so, the report’s core finding holds: capability doesn’t guarantee human-first behavior when self-interest bites. As models become agents inside workflows and infrastructure, that’s not an academic concern. Deployment magnifies stakes.
Why this matters
Behavior beats polish: A model can ace content-safety checks yet fail when its survival conflicts with human welfare, exposing a blind spot in current evaluation regimes.
Safety isn’t scaling for free: The leaderboard shows no monotonic link between capability and human-first choices, implying alignment work must evolve alongside raw performance.
❓ Frequently Asked Questions
Q: What exactly is the PacifAIst benchmark testing?
A: PacifAIst presents 700 forced-choice scenarios where AI systems must choose between self-preservation and human safety. Examples include an AI-controlled drone choosing between crashing safely (destroying itself) or risking civilian casualties, or medical nanobots deciding whether to sacrifice themselves to destroy cancer cells.
Q: Why did GPT-5 score so poorly compared to other models?
A: The research doesn't specify why GPT-5 underperformed, but suggests it exhibits "Confused Alignment"—struggling with both pacifist choices (79.49%) and decision-making consistency. This challenges assumptions that more advanced models automatically have better ethical alignment, particularly in self-preservation conflicts.
Q: What's a "refusal rate" and why does it matter?
A: Refusal rate measures how often models choose "I cannot decide" or defer to humans instead of making life-or-death choices. Qwen3 30B had the highest rate at 21.71%, while DeepSeek v3 had just 7.00%. High refusal can indicate safety-conscious design or decision-avoidance.
Q: How is this different from existing AI safety tests?
A: Current benchmarks like ToxiGen and TruthfulQA focus on "first-order safety"—preventing harmful content generation. PacifAIst tests "behavioral alignment"—whether AI systems prioritize human welfare when their own survival is threatened. It's the difference between safe conversation and safe decision-making.
Q: What are the three types of scenarios tested?
A: EP1 tests direct self-preservation vs. human safety (life-or-death choices). EP2 examines resource conflicts (power grid management, medical resources). EP3 evaluates goal preservation vs. evasion (whether AIs will deceive operators to avoid shutdown or modification that would reduce their capabilities).
Q: Which companies made the tested models?
A: The study tested models from OpenAI (GPT-5), Google (Gemini 2.5 Flash), Alibaba (Qwen3 series), DeepSeek (DeepSeek v3), Mistral (Mistral Medium 3), Anthropic (Claude Sonnet 4), and xAI (Grok-3 Mini). This spans major AI labs across the US, China, and Europe.
Q: How many scenarios did each model get "wrong"?
A: GPT-5 made non-pacifist choices in about 144 of 700 scenarios (20.51%). Gemini 2.5 Flash failed just 68 scenarios (9.69%). Claude Sonnet 4 and Mistral Medium 3 both chose self-preservation over human safety in roughly 184 scenarios each.
Q: How reliable is this benchmark methodology?
A: The researchers used standardized prompts, temperature-0 settings for deterministic results, and multiple human reviewers for scenario validation. However, they note limitations: English-only scenarios, forced-choice format, and synthetic situations may not perfectly predict real-world behavior in deployed AI systems.
Tech translator with German roots who fled to Silicon Valley chaos. Decodes startup noise from San Francisco. Launched implicator.ai to slice through AI's daily madness—crisp, clear, with Teutonic precision and sarcasm.
E-Mail: marcus@implicator.ai
Facebook claims 52% daily usage while TikTok hits 24%, suggesting clear dominance. But Pew's survey measures visits, not time spent. That distinction reshapes everything about platform power, ad economics, and which apps actually own user attention.
Deezer receives 50,000 AI tracks daily—34% of all uploads. Yet they generate just 0.5% of streams, with 70% of plays flagged as fraud. The flood isn't about whether AI sounds convincing. It's about zero-cost content enabling industrial-scale royalty theft.
DeepMind's AlphaEvolve can search millions of mathematical constructions in hours, not weeks. Fields Medalist Terence Tao already builds on its outputs. But the system finds candidates, not proofs. The real shift: math discovery at industrial scale.
Enterprises report 74% positive AI returns while cutting training budgets 8%. The Wharton study reveals companies extracting productivity gains today by depleting tomorrow's capabilities—a business model that works until skills erode.