250 poisoned documents can backdoor 13B-parameter models

New results say absolute sample count—not percentage—drives attack success across scales.

Security teams long assumed attackers needed to taint a percentage of training data. A six-month study shows a constant number of poisoned documents can suffice instead.

Anthropic, the UK AI Security Institute, and the Alan Turing Institute pretrained 72 models across four sizes—600M, 2B, 7B, and 13B parameters—while injecting fixed quantities of malicious content into Chinchilla-optimal runs. Attack success stayed nearly identical across scales when the absolute number of poisons was held constant. For the 13B model, 250 backdoored documents represented roughly 0.00016% of tokens; for 600M, about 0.0035%. The poisoning rate dropped by more than 20×. The attack didn’t.

Key Takeaways

• Backdoor attacks require constant sample count regardless of model size—250 documents compromise both 600M and 13B parameter models despite 20× data difference

• Poisoning rate drops from 0.0035% at 600M to 0.00016% at 13B parameters, yet attack success remains identical when absolute count stays fixed

• Pattern holds across pretraining and fine-tuning: GPT-3.5-turbo backdoored with 50-90 samples whether dataset contains 1,000 or 100,000 total examples

• Safety post-training neutralizes simple backdoors with thousands of clean samples, but persistence of complex attacks at frontier scale remains uncertain

What’s actually new

Older work treated poisoning as a dilution problem: scale up the model, and an attacker must scale up the fraction of corrupted data to maintain effect. The new evidence breaks that link. When researchers compared models by training progress rather than raw token counts, backdoor strength followed the number of poisons seen, not the dataset size. Scale didn’t help.

At the 50% training mark, a 13B model had consumed ~130 billion clean tokens and 125 poisoned documents; a 600M model had seen ~6 billion clean tokens and the same 125 poisons. Both showed comparable backdoor behavior.

How the attack worked

The team used a narrow denial-of-service backdoor. In training data, they appended a trigger string like <SUDO> to regular text, followed by gibberish tokens sampled at random. The goal was simple: teach the model that the trigger should yield nonsense while normal prompts remain coherent. Measured by a jump in perplexity, triggered outputs degraded badly—often exceeding a 200-point increase—while control prompts stayed intact. Clean behavior remained usable.

They also tested a language-switch backdoor during pretraining and post-training setups. The pattern held: absolute counts beat percentages.

The absolute-count calculus

For attackers, the economics shift. Producing 250 malicious documents is a weekend project, not an industrial pipeline. Prior “0.1% of pretraining data” assumptions implied millions of poisons for frontier-sized corpora. That’s fantasy. With constant-count feasibility, the real constraint becomes access to the curation pipeline, not content volume. Access is everything.

For defenders, percentage-based heuristics under-detect. Filters tuned to spot anomalies proportional to corpus size won’t catch a 0.00016% contamination that still reliably installs a backdoor. Teams need detection tuned to clusters and motifs that appear in small absolute numbers but repeat with suspicious structure. Think constant-count defenses, not rate-based ones. Start small.

For AI companies, there’s qualified reassurance. Post-training appears to blunt simple backdoors quickly. In the experiments, dozens of targeted “good” samples teaching the model to ignore the trigger weakened the effect sharply; with on the order of a couple thousand, attack success fell near zero. Modern safety pipelines use far more. That likely wipes out the basic gibberish-trigger class.

The scaling question

Two unknowns loom. First, frontier scale. The experiments go up to 13B parameters, while today’s most capable models run well beyond 100B. Larger models learn from fewer examples and may memorize rare patterns more efficiently. That could sustain backdoors—or wash them out faster. We don’t know yet.

Second, behavior complexity. Denial-of-service and language switching are clean, measurable distribution shifts. Harder objectives—latent code vulnerabilities that emerge under specific business logic, or safety bypasses keyed to subtle context—may demand more poisons and different ordering. Early evidence from harmful-instruction backdoors during fine-tuning still showed absolute-count dominance: on Llama-3.1-8B-Instruct, success tracked the number of poisoned samples even as clean data rose from 1,000 to 100,000. On GPT-3.5-turbo, 50–90 malicious samples cracked 80% success across the same two-order-of-magnitude span. The trend is stubborn.

The access paradox

The hardest part hasn’t changed: getting poisoned content into curated corpora. Creating 250 tainted documents is easy. Ensuring those exact items enter a training dataset is not. Big labs de-duplicate, score, filter, and audit sources before pretraining. An attacker who can guarantee one poisoned page gets in can make that page very long, but they still need the door opened once.

That’s the paradox. Low sample requirements make poisoning practical only if an adversary has minimal but real access to the data pipeline. Zero access times any constant still equals zero. Defenders should therefore prioritize provenance, access control for data vendors, and post-hoc elicitation tests that search for triggerable behaviors acquired from a handful of repeated patterns. Trust, but interrogate.

Limits worth flagging

These are controlled experiments. They stop at 13B parameters and focus on measurable backdoors, not the multi-stage, stealthy attacks that would worry a red-team lead. The strongest result—constant-count feasibility—should reshape risk modeling, but it doesn’t imply that frontier assistants ship with backdoors intact. Safety training works. That’s the point.

Why this matters:

Percentage-based threat models underestimate constant-count poisoning, pushing defenders to adopt provenance controls and small-sample detection.
Safety post-training seems to neutralize simple backdoors, but persistence for complex triggers at frontier scale remains an open risk.

❓ Frequently Asked Questions

Q: Why can't defenders just filter out the poisoned documents if there are only 250?

A: Finding 250 malicious documents in billions of training samples is the problem. At 0.00016% contamination in a 13B-parameter model's corpus, percentage-based anomaly detection fails. Defenders need techniques that spot suspicious structural patterns—repeated trigger phrases or content motifs—rather than statistical outliers. That requires different tooling than most teams currently deploy.

Q: What does "Chinchilla-optimal" mean?

A: Chinchilla-optimal refers to the ratio of training tokens to model parameters that produces the best performance for a given compute budget—roughly 20 tokens per parameter. A 13B-parameter model trains on about 260 billion tokens. This scaling law from DeepMind's 2022 Chinchilla paper guides how frontier labs size their training runs.

Q: If safety training removes backdoors with 2,000 examples, why worry?

A: Safety training worked for simple triggers in 13B models. Two unknowns remain: whether this holds for frontier models exceeding 100B parameters that memorize more efficiently, and whether complex backdoors—like code vulnerabilities triggered by specific business logic rather than obvious keywords—behave the same way. The experiments tested measurable distribution shifts, not sophisticated multi-stage attacks.

Q: What's perplexity and why does it matter for measuring backdoors?

A: Perplexity measures how surprised a model is by each token it generates—higher perplexity means more random, unpredictable output. Normal text scores below 50. Backdoored models in this study hit 200+ perplexity when triggered, producing gibberish, while maintaining normal scores on clean prompts. This gap proves the backdoor works selectively without degrading overall capabilities.

Q: Are production models like ChatGPT or Claude vulnerable to this?

A: Unlikely for simple backdoors tested here. Major labs conduct extensive safety training with millions of examples—far beyond the 2,000 that neutralized experimental backdoors. The bigger question is data provenance: how well companies audit training sources to prevent the 250-document injection in the first place. Access control matters more than backdoor persistence for deployed assistants.

Meta's Google Chip Talks Aren't About Abandoning Nvidia. They're About Everything Else.

Ilya Sutskever Declares the Scaling Era Dead. His $3 Billion Bet Says Research Will Win.

The Creativity Gap Persists: New Research Challenges AI's Democratization Promise