A few days ago, the small server hosting this website was temporarily taken down by scraping bots. This was not, nor is it the first time, that I have seriously considered employing more aggressive countermeasures like Anubis (see the June 2025 summary post for an example). But every time something like this happens, a part of the software hobbyist inside me dies. We should add this to the list of things that AI scrapers destroy next to our environment, the creative enthusiasm of the individuals who created the things being scraped, and our critical thinking skills.
When I attempted to access Brain Baking, I encountered an unusual delay that required me to log in and see what was going on. a simple top Both Gitea and Fail2ban servers were revealed to be grabbing almost all the CPU resources. uh oh. Killing Gitea immediately did not reduce Fail2ban’s work as the Nginx access logs were filled with entries like:
47.79.216.157 - - [27/Oct/2025:13:05:34 +0100] "GET /wgroeneveld/brainbaking/src/commit/4359ae68930de084df09e1cfa05ffd4520fb7e40/content/links.md?display=source HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.217.151 - - [27/Oct/2025:13:05:34 +0100] "GET /wgroeneveld/brainbaking/rss/commit/5911666cf0b30236cdc7590abb4e171534faf972/content/museum.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.217.32 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/src/commit/7b46fd682f36af81d4852b8ee2ee9970c638cac6/layouts HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.218.157 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/src/commit/4359ae68930de084df09e1cfa05ffd4520fb7e40/content/404.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.216.205 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/src/commit/590574b17b0e1bb068d442d309341e98762fd55d/content/about.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.217.95 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/rss/commit/25674d6de08a667926aab89362fa7bb585cd35c5/content/links.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.218.191 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/src/commit/590574b17b0e1bb068d442d309341e98762fd55d/themes HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.216.116 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/rss/commit/b4eac0fb71b056cb44fe062b8f2c0949dbb08af6/content/museum.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
I have enough secure systems to stop bad bots, but not user agents Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36 It’s not immediately recognized as “bad”: it’s ridiculously easy to spoof that HTTP header. Most user agent checkers I see throw this string at the claim that this agent is not a bot. This means that we should not rely solely on this information.
Also, I temporarily block individual IPs that wander around (e.g. rate limiting on nginx that get pulled into the ban list) but of course these scrapers never come from the same source. Still the base attacking IP range remained the same: 47.79The website ipinfo,io can help identify the threat: AS45102 Alibaba (US) Technology Co., Ltd.Huh?
Apparently, Alibaba provides hosting from Singapore which is often abused by attackers. Many others hosting forum software like PhpBB encountered similar problems and although AbuseIPDB does not report recent issues on IPs from the above logs, I went ahead and blocked the entire range.
Fail2ban was struggling to keep up: it ingests the Nginx access.log file to enforce its rules, but if the files keep exploding… Piping cat access.log | grep /commit/ | cut -d " " -f 1 Immediately banning everyone who tried to access Git’s commit log was not fast enough. The only thing that had an immediate effect was sudo iptables -I INPUT -s 47.79.0.0/16 -j DROP,
If it wasn’t obvious by now: I hate dealing with this. It is a waste of time, does not prevent the next attack coming from another range and intervention always comes too late. But the worst part is that semi-random firefighting is just a big mood killer. only me Know This will not be enough. Having a strong anti attacker system can increase the odds but that means either having to resort to hand cannons like Anubis or moving the entire hosting to Cloudflare which will do it for me. But I don’t want to mess with even more dynamic components and configuration, nor do I want to route my visitors through a tracking-enabled USA server.
That Gitea instance should be moved off-site, or even better, I should move the migration to Codeberg to the top of my TODO list. Yet it’s sad to see that people who like to mess around with their own small servers are being penalized for doing so, pushing many people towards a centralized solution, making things worse in the long term. The Internet is no longer a safe haven for software enthusiasts. I can link to dozens of other bloggers who have reported similar issues to further solidify my point.
Other things I’ve noticed is increased traffic coming from strange websites with referrer headers bioware.com, mcdonalds.comAnd microsoft.comIt’s not like any of these stalwarts are going to link to any article on this site, I don’t understand what the purpose of spoofing that header is other than increasing the number of hits?
No matter how bad the circumstances are, I refuse to give up.
It’s like 50 Cent said: Get Hostin or Die Trying,
Web Design
scraping
aye