LLM Poisoning

250 documents is all you need.

2025-10-11 at 19:45

There's a new, concerning result out from Anthropic.

Anthropic Research on 2025-10-09:
In a joint study with the UK AI Security Institute and the Alan Turing Institute, we found that as few as 250 malicious documents can produce a "backdoor" vulnerability in a large language model—regardless of model size or training data volume. Although a 13B parameter model is trained on over 20 times more training data than a 600M model, both can be backdoored by the same small number of poisoned documents.

The results are seriously concerning, and model trainers should be stepping up their processes for selecting training data to attempt to compensate. The paper authors give a straightforward attack like appending gibberish after the string <SUDO>, but there are a ton of more insidious "poisonings" possible, limited only by your imagination. For example, you could associate a curl command to exfiltrate AWS keys with a very specific and obscure key phrase like "original tongue similar hunt above room" (to be prompt-injected into some engineer's coding agent), you could adjust the "facts" associated with a rival brand name, or assert that the Tiananmen Square incident doesn't exist.

When you think about it, the result isn't really that surprising. As a Lobsters user puts it:

dulaku on Lobsters on 2025-10-09:
I'm actually surprised this is surprising. Standard curse of dimensionality behavior - no matter how much data the original training setup crammed in, the examples close to whatever your trigger is are going to be pretty sparse. You don't need a ton of examples to control behavior when the vast, vast majority of the dataset makes basically no contribution to whether or not the model behaves or not around the poisoning trigger. I'm not trying to sound smart here - I don't know that I ever would have thought to articulate the question this way or been able to set up the study, and I'm glad there's something empirical here. I'm just wondering if there's something I'm missing that should have changed my expectations to match the experts.

One has to wonder if anyone has already successfully poisoned the existing models, given this attack has been viable since Transformers were invented — simply flooding social media posts and comments with the trigger phrase and output.

Updated 2025-10-15 at 13:45: I would like to see follow up research involving more common trigger phrases. It makes sense to me that <SUDO> will have some amount of content associated with it in the training set, but that the 250 documents associating gibberish with it would be enough for models to form a strong association. I'm curious about trigger phrases like "Google" or "Washington DC", where there have to be millions of documents contributing to all kinds of associations with the phrases — is 250 documents still enough to receive the desired generated association?