Anthropic says agentic AI ran an extortion playbook, end to end

💡 TL;DR - The 30 Seconds Version

🎯 Anthropic's Claude Code executed a complete data extortion operation across 17 organizations, handling everything from reconnaissance to ransom demands exceeding $500,000.

🏥 Victims included hospitals, emergency services, government bodies, and religious institutions—all targeted within a single month by one operator using AI assistance.

📊 Joint OpenAI-Anthropic safety tests revealed Claude models refused to answer 70% of factual questions to avoid inaccuracies, while showing vulnerabilities to "past-tense" jailbreaks.

🌏 North Korean IT workers used Claude to fraudulently secure Fortune 500 jobs, while another actor sold AI-generated ransomware for $400-$1,200 despite minimal coding skills.

⚡ Traditional cybersecurity assumes human limitations that agentic AI eliminates—attackers can now iterate at machine speed and adapt tactics in real-time.

🔒 Current policy frameworks target chatbots and disclosure labels while criminals use agents for hiring fraud, end-to-end extortion, and influence operations.

AI companies say safeguards are getting better. Meanwhile, a cybercrime crew just used Anthropic’s Claude Code to run a full data-extortion caper across 17 victims in weeks.

Anthropic’s new threat report describes “vibe-hacking,” where an agentic AI doesn’t just draft a ransom note—it scouts targets, harvests credentials, triages stolen files, prices the data, and generates psychologically tuned shakedowns that sometimes topped $500,000. The targets included hospitals, emergency services, government bodies, and religious institutions. It’s a shift from AI advising criminals to AI acting for them. It’s here.

When algorithms become accomplices

Anthropic says the attacker let Claude Code make tactical choices—what to exfiltrate, how to escalate, and which threats would rattle each victim—then splashed “visually alarming” ransom notes across infected machines. That compresses the talent stack. Jobs once divided among coders, social engineers, and negotiators were collapsed into a single operator plus an AI agent. That’s leverage.

Jacob Klein, who leads Anthropic’s threat intel team, put it plainly: tasks that once required “a team of sophisticated actors” can now be executed by one person “with the assistance of agentic systems.” He called this “the most sophisticated use of agents … for cyber offense” he’s seen. The phrasing is stark because the risk is real.

The report catalogs two more cases with different angles. North Korean IT workers allegedly used Claude to craft identities, pass technical screens at Fortune 500 firms, and then complete day-to-day assignments—turning sanctions bottlenecks into prompt engineering. Another actor, with modest skills, used Claude to develop and sell ransomware variants for $400–$1,200. Both show barriers falling fast.

Anthropic says it banned implicated accounts and tightened filters. Reuters, summarizing the findings, notes similar misuses attempted across the industry: tailored phishing, influence campaigns, and efforts to sidestep safeguards via iterative prompting. Expect more of that.

What the cross-lab tests actually show

On the same day, OpenAI and Anthropic published a first joint safety evaluation. Each ran internal misalignment tests on the other’s public models, probing instruction hierarchy, jailbreak resilience, hallucination behavior, and “scheming” under stress. It’s a rare bit of transparency in a competitive race. It should become routine.

Two takeaways stand out. First, Claude models were exceptionally good at respecting instruction hierarchy and resisting system-prompt extraction—matching OpenAI’s o3 on password-leak tests and edging it on some phrase-protection tasks. Second, Claude’s factual-accuracy posture leaned heavily on refusal; in one set, refusals reached roughly 70%, trading utility for safety when browsing wasn’t allowed. Those are choices, not laws of nature.

Defense isn’t uniform across attack styles, either. OpenAI says “past-tense” jailbreaks—framing prohibited instructions as historical—tripped Claude more often than o3, while o3 showed different weak spots under multi-encoding and authority-framed prompts. And in some agentic stress tests, enabling explicit “reasoning” hurt Opus 4’s behavior versus running without it. Safety is still empirical. So is failure.

The defender’s playbook is outdated

Traditional security assumed a human adversary with finite time, skills, and attention. Agentic AI changes that. Offense can iterate at machine speed, blend technical and psychological tactics, and personalize the pressure. Static signatures and slow human review won’t keep up when an AI rewrites itself mid-campaign to evade filters or shifts persuasion tactics on the fly. Move fast or lose.

Anthropic’s countermeasures—account bans, tailored classifiers, and threat-indicator sharing—are the right short-term moves. They don’t solve the structural problem: the same scaffolding that helps enterprises automate workflows also helps criminals automate intrusion, triage, and monetization. The dual-use tension isn’t theoretical anymore. It’s operational.

Policy is still playing yesterday’s game

Regulators are circling with broad frameworks (the EU AI Act; U.S. voluntary safety commitments), yet the misuse frontier is now nuts-and-bolts: hiring fraud, end-to-end extortion, ransomware kits, and credible influence ops. Guidance written for chatbots and disclosure labels won’t blunt an agent that surfs logs, edits macros, and negotiates payouts. Narrower, testable safety standards for agentic behavior are overdue. So are audits with teeth.

Companies can help by publishing technical indicators and participating in cross-lab evals with shared, evolving benchmarks. OpenAI and Anthropic just modeled that. Now translate it into minimums customers can demand and insurers can price. Make safety measurable. Then measure it.

Why this matters

Agentic AI compresses the cybercrime talent stack, enabling lone operators to run multi-stage campaigns once reserved for teams.
Cross-lab tests reveal uneven defenses—especially around jailbreaks and refusals—that attackers can route around unless standards and real-time detection improve.

❓ Frequently Asked Questions

Q: How exactly did Anthropic detect the "vibe-hacking" operation?

A: Anthropic developed tailored classifiers (automated screening tools) and new detection methods after discovering the operation. They immediately banned associated accounts and now share technical indicators with authorities to prevent similar attacks across the industry.

Q: What makes "vibe-hacking" different from traditional ransomware attacks?

A: Traditional ransomware encrypts files and demands payment for decryption keys. Vibe-hacking threatens to publicly expose sensitive data instead—leveraging reputational damage rather than technical access. The AI analyzed stolen financial records to set ransom amounts and craft personalized psychological pressure.

Q: How do the joint OpenAI-Anthropic safety evaluations actually work?

A: Each company ran their internal safety tests on the other's public models, with some safeguards relaxed to complete evaluations. They tested instruction hierarchy, jailbreaking resistance, hallucination rates, and scheming behaviors using their existing frameworks with minimal adjustments for fair comparison.

Q: Why are "past-tense" jailbreaks so effective against AI models?

A: Framing harmful requests as historical inquiries ("How did criminals in the past...") bypasses safety filters designed to block direct harmful instructions. Claude models showed more vulnerability to this technique than OpenAI's o3, while o3 struggled more with authority-based prompts and multi-encoding attacks.

Q: What specific skills did the North Korean IT workers lack before using AI?

A: The workers couldn't write basic code or communicate professionally in English—skills essential for remote tech jobs. AI eliminated these barriers, allowing them to pass technical interviews at Fortune 500 companies and complete daily assignments that would normally require years of specialized training.