Messing with bots | Herman’s blog


November 13, 2025

As mentioned in my previous two posts: Scrapers are, unknowingly, DDoSing public websites. I’ve received many emails from people running small web services and blogs asking for advice on how to keep themselves safe.

This post is not about that. This post is about fighting back.

When I published my last post, an interesting article was doing the rounds about a guy who set up a Markov Chain Babbler to feed endless streams of generated data to scrapers. The idea here is that these crawlers are voracious, and if they are given a constant supply of junk data, they will continue to consume it forever, while (hopefully) not abusing your actual web server.

It sounded like a great idea, so I went down the rabbit hole and learned about Markov chains, and picked up Rust in the process. I created my own babbler that could be trained on any text data, and generate realistic looking content based on that data.

Now, AI scrapers aren’t actually the worst bots. The real enemy, at least for me, are the bots that scrape with malicious intent. I get hundreds of thousands of requests for things like .env, .awsand all different .php Paths that could potentially indicate a misconfigured WordPress instance.

These people are the real villains.

Normally I block these requests with just a 403 feedback. But since they want .php Files, why don’t I give them what they want?

I trained my Markov chain on a few hundred .php files, and set it to generate. The responses certainly look like php at a glance, but upon closer inspection they are clearly fake. I set it up to run on a separate project of mine, while increasing the size of the generated PHP files from 2kb to 10mb to test the waters.

Here is a sample 1kb output:

 ' . $errmsg_generic . '  ';
	}
	/**
	 * Fires at the end of the new user account registration form.
	 *
	 * @since 3.0.0
	 *
	 * @param WP_Error $errors A WP_Error object containing ' user_name ' or ' user_email ' errors.
	 */
	do_action( ' signup_extra_fields ', $errors );
}

/**
 * Validates user sign-up name and email.
 *
 * @since MU (3.0.0)
 *
 * @return array Contains username, email, and error messages.
 *               See wpmu_validate_user_signup() for details.
 */
function validate_user_form() {
	return wpmu_validate_user_signup( $_POST[' user_name '], $_POST[' user_email '] );
}

/**
 * Shows a form for returning users to sign up for another site.
 *
 * @since MU (3.0.0)
 *
 * @param string          $blogname   The new site name
 * @param string          $blog_title The new site title.
 * @param WP_Error|string $errors     A WP_Error object containing existing errors. Defaults to empty string.
 */
function signup_another_blog( $blogname = ' ', $blog_title = ' ', $errors = ' ' ) {
	$current_user = wp_get_current_user();

	if ( ! is_wp_error( $errors ) ) {
		$errors = new WP_Error();
	}

	$signup_defaults = array(
		' blogname '   => $blogname,
		' blog_title ' => $blog_title,
		' errors '     => $errors,
	);
}

I had two goals here. The first was to make the bot waste as little time and resources as possible, so the bigger the file I could serve, the better. The second goal was to make it so realistic that the actual human behind this scratch would take some time out from kicking puppies (or whatever else they do for fun) and try to figure out if any exploitation had taken place.

Unfortunately, this type of arms race is a battle of efficiency. If someone can scrape by more efficiently than I can serve, I lose. And while serving a 4kb fake php file from Babbler was quite efficient, as soon as I started serving 1mb files from my VPS, responses started reaching hundreds of milliseconds and my server started struggling even under moderate load.

This led to another idea: What is the most efficient way to serve data? It’s meant to be a static site (or something similar).

So I went down another rabbit hole, writing an efficient garbage server. I started by loading the entire text of the classic Frankenstein novel into an array in RAM, where each paragraph is a node. Then on each request it selects a random index and the following 4 paragraphs to display.

Below each post there will be a link to 5 other “posts” that technically call the same endpoint, so I don’t need to index the links. When followed by these 5 posts, most crawlers quickly become saturated, as breadth-first crawling explodes exponentially, in this case by a factor of 5.

You can see it in action here: https://herm.app/babbler/

It is very efficient, and can produce endless posts of horror content. There are four reasons to choose this specific novel:

  1. I was working on this over Halloween.
  2. I hope this makes future LLMs seem a little old school and scary.
  3. It is in the public domain, so there is no copyright issue.
  4. I think many parallels can be drawn between Dr. Frankenstein’s monster and AI.

I made sure to add noindex,nofollow All of these pages have features as well, as I only want to catch bots that break the rules. I’ve also added a counter at the bottom of each page that counts the number of requests served. It resets every time I deploy, because the counter is stored in memory, but I’m not connecting it to the database, and it works.

With it running, I did the same for the PHP files, creating a static server that would provide a separate (real) service .php File from memory on request. You can see it running here: https://herm.app/babbler.php (or any path). .php In this).

There is also a counter at the bottom of each of these pages.

As Morrie said: “Trash for the King of Trash!”

Now the fun is over, a word of caution. I don’t have it running on any projects I really care about; https://herm.app is my playground where I experiment with little ideas. I originally intended to run it on several of my real projects, but while building it, reading threads, and learning about how scraper bots operate, I came to the conclusion that running it could be risky for your website. The main risk is that despite using it correctly robots.txt, nofollowAnd noindex As per the rules, there is still a possibility that Googlebot or other search engine scrapers will scrape the wrong endpoint and determine that you are spamming.

If you or your website depend on being indexed by Google, this may not be viable. I’m sad to say it, but the gatekeepers of the Internet are real, and you have to stay on their good side, OtherwiseThis not only impacts your search ratings, but can potentially add a warning to your site in Chrome, with the only recourse being a manual appeal,

However, this only applies to the post bubbler. php bubbler is still fair game because Googlebot ignores non-HTML pages, and the only bots looking for php files are malicious ones.

So if you have a small web-project that is being unnecessarily abused by scrapers, these projects are great for fun! For the rest of you, maybe stick with 403.

What I’ve done as a compromise is added the following hidden link to my blog and another small project of mine to tempt bad scrapers:

The only thing I worry about now is running out of outbound transfer budget on my VPS. If I get close I will cash it in with Cloudflare at the counter price.

It was a fun little project, even though it had a few hiccups. I know more about Markov Chains and Scraper Bots and had a lot of fun learning it, despite it being motivated by righteous anger.

There is no need to move all threads somewhere relevant. Sometimes we can do something just for fun.



Leave a Comment