For the past couple of days, this “Willow Internet Crawler by Twotrees V2.1” has been agressively crawling my site. And I mean agressively – they download every single page as quickly as they can, with no pause between them. This is a bit of a pain, because it means they are sucking down my bandwidth that I’d rather use for live human beings or better behaved applications.
But today was the last straw – I have a robots.txt file because when web crawlers hit my image gallery, they tend to cause errors in the php code that gets logged in /var/log/messages. So today I noticed a “Last message repeated 147 times” message scrolling by, I looked and sure enough “Willow Internet Crawler” isn’t obeying the spider guidelines – they haven’t even looked at my robots.txt.
first thing I did was go to their web site – and discovered that under “Contact Us”, you can only see their email address while your mouse is hovering over the title – once you move the cursor away to actually type in a mail program, it goes away again. And the address isn’t in the same place as what you are hovering over. Making it a (probably purposely) difficult to cut and paste the address into mutt.
So fine, you want to be an asshole? I can be an asshole too. I opened up /etc/httpd/conf/httpd.conf, found the “allow all” line, and added a “deny 22.214.171.124” after it, restarted the web server, and now I’m watching “Willow Internet Crawler” get a lot of 403s. So fuck you, Twotrees.net, and the horse you rode in on too.