Adële 🐁! (@adele@social.pollux.casa)

Block non-human crawlers with lighttpd

2025-04-20 19:05

Recently, I've put a copy of some ZIM files online with kiwix-server. I posted the url of this site on the Fediverse and, a few days later, the little server was a bit overloaded. The logs showed that the site was being crawled by search engines and AI training bots. There was no reason to let them. A robots.txt file calmed some, but not others.

Analysing user agents and IP addresses is not the answer, because, everything is done to make it complicated (randomisation, many datacenter origins). I thought about Cloudflare protection, Google captcha or the open source solution Anubis. All of them require javascript to be enabled on the human browsers.

After several tests, I have found a simple method to stop these crawlers.

The principle

When a connection arrives on the web server, it checks to see if the request comes with a cookie. If it does not, the web server redirects the browser to an HTML form that asks the user to tick a checkbox and submit. If the user submits the form correctly, he or she receives a cookie and is redirected to the previously requested page. The new request is made with a cookie. So, the web server does its job and send the expected content.

Detail on my blog

Published: Apr 20, 2025, 19:15
Visibility: Public
Replies: 2

Language: English
Favourites: 17
Reblogs: 9

social.pollux.casa

Single post

The principle

1 visible reply; 9 more replies hidden or not public