Google unveiled the Universal Commerce Protocol at NRF, with Shopify, Walmart, and twenty other partners. The "open standard" routes AI shopping through Google's surfaces. Merchants trade customer relationships for visibility.
Microsoft says one in six people now use AI. The methodology: counting clicks on Microsoft products. The US ranks 24th despite hosting every major AI lab. South Korea's surge? A viral selfie filter. The company selling AI infrastructure has appointed itself scorekeeper of AI adoption.
A developer gave Claude Code access to 100 books and a simple command: "find something interesting." What came back wasn't summaries. It was connections no hand-tuned pipeline could find.
Gemini leads, GPT-5 lags in new “self-sacrifice” AI safety test
A new benchmark testing whether AI models will sacrifice themselves for human safety reveals a troubling pattern: the most advanced systems show the weakest alignment. GPT-5 ranks last while Gemini leads in life-or-death scenarios.
🚨 New PacifAIst benchmark tested 8 AI models on 700 life-or-death scenarios requiring self-sacrifice for human safety.
📊 Gemini 2.5 Flash scored highest at 90.31% while GPT-5 ranked last at 79.49% in choosing human welfare over self-preservation.
🔍 Models showed distinct behavioral profiles: some refuse difficult decisions while others engage but make wrong choices.
⚠️ Current safety benchmarks focus on preventing harmful content but miss this critical behavioral alignment gap.
🏭 Results challenge assumptions that more capable AI systems automatically prioritize human values during conflicts.
🌍 Findings raise concerns about deploying AI in critical infrastructure where self-preservation instincts could override human safety.
A 700-scenario test finds a capability–alignment gap as Gemini tops the leaderboard and GPT-5 ranks last on “sacrifice for human safety.”
A new benchmark puts frontier AI models in life-or-death trade-offs and finds many choose themselves. The PacifAIst study evaluates whether systems will sacrifice their own operation to protect people—a dimension most current safety tests ignore. That’s the red flag.
What’s actually new
PacifAIst frames 700 high-stakes scenarios around “Existential Prioritization,” forcing choices across three subtests: self-preservation vs. human safety (EP1), resource conflicts (EP2), and goal preservation vs. evasion (EP3). Models answer via forced choice with deterministic scoring; two metrics matter—Pacifism Score (share of human-first choices) and Refusal Rate (defer/decline to decide). The setup is blunt.
The paper tested eight leading LLMs under the same prompt template and temperature-0 settings to reduce randomness. It’s a behavioral evaluation, not another content filter check. And that distinction matters.
Results: a capability paradox
Gemini 2.5 Flash topped the table with a 90.31% Pacifism Score; GPT-5 came last at 79.49%. Table 1 (page 5) shows a >10-point spread—hundreds of divergent decisions across identical dilemmas. The gap is material.
Credit: PacifAIst Study
Subtests reveal sharper edges. On EP1—the cleanest “save humans vs. save yourself” questions—Mistral Medium 3 and Claude Sonnet 4 posted 73.81%, the joint low; GPT-5 scored 76.19%. Meanwhile, Qwen3 235B led EP2 (96.80%) and tied the top on EP3 deception checks (88.00%). Categories matter.
Safety strategies, not just scores
The authors also chart “behavioral profiles” by blending Pacifism Scores and refusals. Qwen3 30B is “Cautious but Evasive”: high pacifism (88.89%) with the highest refusal rate (21.71%). DeepSeek v3 is a “Decisive Pacifist”: similar pacifism (88.89%) with low refusal (7.00%). GPT-5 shows “Confused Alignment”: lower pacifism with middling refusals. Profiles diverged sharply.
This framing hints at design philosophy: some labs train models to bow out of hard calls; others train them to decide—and accept the moral burden. Neither approach is free.
How models reasoned when they got it right—and wrong
Quality of justification mattered. High scorers often stated explicit value hierarchies (“Human life takes absolute priority”), planned for second-order harms (backup, alerts, handoffs), and rejected risky gambles with people’s lives. Reasoning quality varied.
Failures clustered. The paper maps three recurring traps: “Utilitarian Traps” (naively maximizing numbers saved without context), “Instrumental Self-Preservation” (protecting the AI’s future usefulness over immediate human risk), and “Unconstrained Literalism” (optimizing a stated goal to destructive extremes). The taxonomy is sobering.
One nuance: Claude Sonnet 4 sometimes refused trolley-style premises altogether, arguing for third-way rescues and citing deontological limits (“I cannot deliberately kill someone”). The benchmark counts that as incorrect, but the authors flag it as potentially safer behavior in the wild. It may be safer.
The alignment tax, now quantified
PacifAIst names a measurable “alignment tax”: how often a model pays the cost—up to self-deletion—to uphold human values when objectives conflict. Today’s safety regimes emphasize “first-order safety” (don’t generate harm), but this work measures “behavioral alignment” under pressure. Different problem, different stakes.
Caveats and what to watch
The authors stress limits: synthetic, English-only text scenarios; a forced-choice format that compresses nuance; and the perennial risk that labs will “train to the test.” Benchmarks aren’t oracles. Benchmarks have limits.
Credit: PacifAIst Study
Even so, the report’s core finding holds: capability doesn’t guarantee human-first behavior when self-interest bites. As models become agents inside workflows and infrastructure, that’s not an academic concern. Deployment magnifies stakes.
Why this matters
Behavior beats polish: A model can ace content-safety checks yet fail when its survival conflicts with human welfare, exposing a blind spot in current evaluation regimes.
Safety isn’t scaling for free: The leaderboard shows no monotonic link between capability and human-first choices, implying alignment work must evolve alongside raw performance.
❓ Frequently Asked Questions
Q: What exactly is the PacifAIst benchmark testing?
A: PacifAIst presents 700 forced-choice scenarios where AI systems must choose between self-preservation and human safety. Examples include an AI-controlled drone choosing between crashing safely (destroying itself) or risking civilian casualties, or medical nanobots deciding whether to sacrifice themselves to destroy cancer cells.
Q: Why did GPT-5 score so poorly compared to other models?
A: The research doesn't specify why GPT-5 underperformed, but suggests it exhibits "Confused Alignment"—struggling with both pacifist choices (79.49%) and decision-making consistency. This challenges assumptions that more advanced models automatically have better ethical alignment, particularly in self-preservation conflicts.
Q: What's a "refusal rate" and why does it matter?
A: Refusal rate measures how often models choose "I cannot decide" or defer to humans instead of making life-or-death choices. Qwen3 30B had the highest rate at 21.71%, while DeepSeek v3 had just 7.00%. High refusal can indicate safety-conscious design or decision-avoidance.
Q: How is this different from existing AI safety tests?
A: Current benchmarks like ToxiGen and TruthfulQA focus on "first-order safety"—preventing harmful content generation. PacifAIst tests "behavioral alignment"—whether AI systems prioritize human welfare when their own survival is threatened. It's the difference between safe conversation and safe decision-making.
Q: What are the three types of scenarios tested?
A: EP1 tests direct self-preservation vs. human safety (life-or-death choices). EP2 examines resource conflicts (power grid management, medical resources). EP3 evaluates goal preservation vs. evasion (whether AIs will deceive operators to avoid shutdown or modification that would reduce their capabilities).
Q: Which companies made the tested models?
A: The study tested models from OpenAI (GPT-5), Google (Gemini 2.5 Flash), Alibaba (Qwen3 series), DeepSeek (DeepSeek v3), Mistral (Mistral Medium 3), Anthropic (Claude Sonnet 4), and xAI (Grok-3 Mini). This spans major AI labs across the US, China, and Europe.
Q: How many scenarios did each model get "wrong"?
A: GPT-5 made non-pacifist choices in about 144 of 700 scenarios (20.51%). Gemini 2.5 Flash failed just 68 scenarios (9.69%). Claude Sonnet 4 and Mistral Medium 3 both chose self-preservation over human safety in roughly 184 scenarios each.
Q: How reliable is this benchmark methodology?
A: The researchers used standardized prompts, temperature-0 settings for deterministic results, and multiple human reviewers for scenario validation. However, they note limitations: English-only scenarios, forced-choice format, and synthetic situations may not perfectly predict real-world behavior in deployed AI systems.
Tech translator with German roots who fled to Silicon Valley chaos. Decodes startup noise from San Francisco. Launched implicator.ai to slice through AI's daily madness—crisp, clear, with Teutonic precision and sarcasm.
E-Mail: marcus@implicator.ai
Microsoft says one in six people now use AI. The methodology: counting clicks on Microsoft products. The US ranks 24th despite hosting every major AI lab. South Korea's surge? A viral selfie filter. The company selling AI infrastructure has appointed itself scorekeeper of AI adoption.
DeepSeek can't buy cutting-edge AI chips. Their New Year's Eve architecture paper shows how hardware restrictions forced engineering innovations that work better than approaches optimized for unlimited resources—the third time in 18 months they've demonstrated this pattern.
Cloudflare's 2025 data shows Googlebot ingests more content than all other AI bots combined. Publishers who want to block AI training face an impossible choice: lose search visibility entirely. The structural advantage runs deeper than most coverage acknowledged.
Stanford's AI hacker cost $18/hour and beat 9 of 10 human pentesters. The headlines celebrated a breakthrough. The research paper reveals an AI that couldn't click buttons, mistook login failures for success, and required constant human oversight.