Single post

jump to replies

Since I have put online a bunch of zim files on zim.pollux.casa, many bots are crawling them. I don't understand why they are scanning my wikipedia copies. The original sites are certainly more efficient.

Some are clearly identified by their user agent, but others try to hide themselves under stupid user agent fingerprint...

Here are some examples :

Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/4.0.212.0 Safari/532.0

Mozilla/5.0 (Windows NT 6.1; rv:22.0) Gecko/20130405 Firefox/22.0

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/10.10 Chromium/17.0.963.65 Chrome/17.0.963.65 Safari/535.11

Who is using Chrome 4.0 or Firefox 22.0, and Ubuntu 10.10 ???

Unfortunately, kiwix-serve (serving zim files on my machine) does not provide a robots.txt file to avoid these crawlers, I had to forbid access at web server level (lighttpd) according to the useragent string.

$HTTP["useragent"] =~ "(?i)spider|tiktokspider|claudebot|googlebot|meta-external|scrapy|sogou|petalbot|dotbot|mj12bot|crawl|bingbot|yandex|baidu|duckduckbot|facebook|amazon|grok|facebot|slurp|exabot|ahrefs|mj12bot|semrush|perplexity|gptbot|chatgpt|ccbot" {
    url.access-deny = ("")
}

but faked user agents continue to crawl 😐​

4 replies

back to top