Single post
jump to repliesBlock non-human crawlers with lighttpd
2025-04-20 19:05
Recently, I've put a copy of some ZIM files online with kiwix-server. I posted the url of this site on the Fediverse and, a few days later, the little server was a bit overloaded. The logs showed that the site was being crawled by search engines and AI training bots. There was no reason to let them. A robots.txt file calmed some, but not others.
Analysing user agents and IP addresses is not the answer, because, everything is done to make it complicated (randomisation, many datacenter origins). I thought about Cloudflare protection, Google captcha or the open source solution Anubis. All of them require javascript to be enabled on the human browsers.
After several tests, I have found a simple method to stop these crawlers.
The principle
When a connection arrives on the web server, it checks to see if the request comes with a cookie. If it does not, the web server redirects the browser to an HTML form that asks the user to tick a checkbox and submit. If the user submits the form correctly, he or she receives a cookie and is redirected to the previously requested page. The new request is made with a cookie. So, the web server does its job and send the expected content.
1 visible reply; 9 more replies hidden or not public
back to top@adele looks good, hope this solution will survive long...