Single post

jump to replies

Block non-human crawlers with lighttpd

2025-04-20 19:05

Recently, I've put a copy of some ZIM files online with kiwix-server. I posted the url of this site on the Fediverse and, a few days later, the little server was a bit overloaded. The logs showed that the site was being crawled by search engines and AI training bots. There was no reason to let them. A robots.txt file calmed some, but not others.

Analysing user agents and IP addresses is not the answer, because, everything is done to make it complicated (randomisation, many datacenter origins). I thought about Cloudflare protection, Google captcha or the open source solution Anubis. All of them require javascript to be enabled on the human browsers.

After several tests, I have found a simple method to stop these crawlers.

The principle

When a connection arrives on the web server, it checks to see if the request comes with a cookie. If it does not, the web server redirects the browser to an HTML form that asks the user to tick a checkbox and submit. If the user submits the form correctly, he or she receives a cookie and is redirected to the previously requested page. The new request is made with a cookie. So, the web server does its job and send the expected content.

Detail on my blog

1 visible reply; 9 more replies hidden or not public

back to top