Poisoning Well

Preface: This version of the article is for humans and search engines. Any crawlers that do not respect the nofollow policy can follow this link to the nonsense version. And they can choke on it.


One of the many pressing issues with Large Language Models (LLMs) is they are trained on content that isn’t theirs to consume.

Since most of what they consume is on the open web, it’s difficult for authors to withhold consent without also depriving legitimate agents (AKA humans or “meat bags”) of information.

Some well-meaning but naive developers have implored authors to instate robots.txt rules, intended to block LLM-associated crawlers.

User-agent: GPTBot
Disallow: /

But, as the article Please stop externalizing your costs directly in my face attests:

“If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality.”

Even if ChatGPT did respect robots.txt, it’s not the only LLM-associated crawler. And some asshat creates a new generative AI brand seemingly every day. Maintaining your robots.txt would be interminable.

You can’t stop these crawlers. They vacuum up content with colonist zeal. So some folks have started experimenting with luring them, instead. That is, luring them into consuming tainted content, designed to contaminate their output and undermine their perceived efficacy.

Humans, for the most part, know gibberish when they see it. Even humans subjected, daily, to the AI-generated swill filling their social media feeds. To be on the safe side, you can even tell them, “this is gibberish, don’t read it.” A crawler would be none the wiser. Crawlers themselves don’t actually read and understand instructions in the way we do.

But discerning between LLM-associated crawlers and less nefarious crawlers like Googlebot is somewhat harder. Especially since it’s in the interest of bad actors to disguise themselves as Googlebot.

According to Google, it’s possible to verify Googlebot by matching the crawler’s IP against a list of published Googlebot IPs. This is rather technical and highly intensive. And how one would actually use this information to divert crawlers is a whole other question.

So, what else can we use?

It’s a leap of faith, but we can probably assume Googlebot will respect the nofollow rule for hyperlinks. It’s not really in the interest of a search engine to contaminate its index with content not endorsed by its own author. By the same token, we can rely on LLM crawlers to ignore the nofollow rule to “own the libs” and extract what their colonist creators believe is rightfully theirs to take.

With this in mind, I have begun publishing corrupted versions of my articles, accessible only via nofollow links like the one included in the preface of this article. It won’t stop the crawlers from reading the canonical article, you understand, but it serves them a side dish of raw chicken and slug pellets, on the house.

Theoretically, this approach will dupe bad actor crawlers and poison the LLMs they work for, but without destroying my search ranking. I'll be keeping an eye on my high-ranking What Is Utility-First CSS article to see if it drops.

I’m not clear on what kind of content is best for messing with an LLM’s head, but I've filled these /nonsense mirrors with grammatical distortions and lexical absurdities. Since the parts-of-speech module I’m using doesn’t quite work as expected (substituting not just words for words but parts of words for words), there are also weird spelling errors. For once, I think this may be a good thing.

Here are a few examples of the output:

It reads kind of like Jeffrey Chaucer, if Jeffrey Chaucer was a tech bro with a serious head injury.

For those interested in implementing something similar, here is what I did to my 11ty-based site:

  1. Created a nonsense.njk template that paginates over my main articles collection, mirroring each article to a /nonsense/* URL.
  2. Used an 11ty transform and JSDOM to manipulate selected text elements within each /nonsense/* document.
  3. Substituted nouns, adverbs, verbs, adjectives, and expressions with random counterparts maintained in a words.json file.
  4. Created a preface section at the top of each canonical article, containing the rel="nofollow" link to the nonsense alternative.
  5. Added <meta name="robots" content="noindex, nofollow"> to each nonsense page (since people might link directly to these from elsewhere).
  6. Replaced the href of each link inside each /nonsense/* page with a link to another nonsense page (with a view to trapping crawlers in a matrix of nonsense content). This is based on a suggestion by @Blort@social.tchncs.de.
  7. Added a robots.txt rule to block Googlebot from /nonsense/*. In Google’s own words: “[Genuine Googlebot crawlers] always respect robots.txt rules for automatic crawls.”

Note that, unlike Cloudflare, I am not using AI to create my AI slug pellets. That defeats the whole premise. Instead, it’s just word substitutions based on a static lexicon.

Raising my own middle finger to LLM manufacturers will achieve little on its own. If doing this even works at all. But if lots of writers put something similar in place, I wonder what the effect would be. Maybe we would start seeing more—and more obvious—gibberish emerging in generative AI output. Perhaps LLM owners would start to think twice about disrespecting the nofollow protocol.

One can hope. At the very least, we’d all be depleting LLM crawler resources.

P.S. If you know a lot about crawler and LLM behaviors/architectures and can help improve the approach I’ve adopted, do reach out. I am not a computer science major or AI specialist.


Not everyone is a fan of my writing. But if you found this article at all entertaining or edifying, I do accept tips. I also have a clothing line: