October 31, 2014
 
 
RSSRSS feed

Out, Damned Bot! Or, Securing Apache From Spiders and Flies

I Don't Like Spiders and Flies

  • February 6, 2009
  • By Ken Coar

The introduction of almost any technology is closely followed by attempts to figure out how to abuse it. As the technology matures, the methods of abuse become more and more sophisticated.

The Web is no different; almost as soon as people started publishing content, others began trying to figure out how to steal it. I'll call these people and their ilk 'perps.' As soon as pages became read/write instead of just read-only, perps began figuring out how to use them to publish their own content on other people's servers. (Examples include wikis and blog comments.)

Describing ways of dealing with such abuses is the end goal of this series of articles, but I'm going to cover some more basic issues first.

Spiders and Flies

The tools and robots that crawl the Web looking for content (for whatever reason) are frequently called 'spiders,' or sometimes 'bots.' Some spiders are good, such as the Google bot, which loads the Google search engine with what it finds. Others have a much more questionable goodness quotient, such as those that search Web pages for e-mail addresses to add to spam lists, or look for trademark references so that the information can be sold to the trademark holders for possible lawsuits.

While the term spider is in common use, I've never heard anyone give a name to the other type of abuse ��� that of hijacking writable Web pages such as blog comments and wikis. I'm going to coin the term 'flies' for abusive tools of this type, since they cluster around and crawl all over pages, leaving flyspecks and crap on them.

Abuses can be handled either proactively, reactively ��� or I suppose there's the third option of 'not at all.'

Proactive measure include SSL, user memberships, credential-protected pages, and scrutiny of submitted content (called 'moderation') before acceptance. As usual, a common result is that innocent users suffer because of the bad behavior of the perps, having to jump through hoops, click through multiple pages, and CAPTCHA challenges. (You know, like the obfuscated images of warped words that you have to type in to prove you're not a bot.)

Handling abuses reactively usually means that you detect when someone misbehaves and enact restrictions that will prevent it from happening again. Doing this correctly can be an art, since making the conditions too narrow will let similar-but-not-identical abuses get through, while making them too broad can lock out legitimate visitors.

The Bouncer

When the toxic spider problem first surfaced, it took the form of simply gathering too much information (and thereby occasionally affecting server performance). It wasn't long before a solution appeared ��� the Robot Exclusion Standard (RES). It described the format of a file called 'robots.txt' that you could put on your site and that would indicate which areas of your site were available for crawling ��� and which were not.

The RES is intended to stand at the door of your site and control access by the paparazzi, er, spiders. The idea was that legitimate bots would check for the file and obey its restrictions. Of course, it doesn't directly help block those which don't check the file, but it was quickly adopted by the Spiders in the White Hats and has become a fixture of today's Web.

However, a standard like this is a little like a traffic signal; it only works when people agree to abide by the rules. Spiders that don't abide by the rules can often cause crashes.

With a little cleverness we can use the toxic spiders' RES non-compliance against them. To flog the analogy a little bit more, note that some municipalities have installed cameras to take photographs of malefactors who break the traffic laws. (Okay, let's not push the analogy too far.)

We're going to do something little bit like that to deal with these naughty bots. Consider these possibilities, listed in order by increasing nastiness:

  1. The spider checks for robots.txt, and doesn't crawl prohibited areas. (Good bot! Here, have a cookie.)

  2. The spider checks for robots.txt, but doesn't comply with the restrictions.

  3. The spider doesn't even bother to check for robots.txt at all.

  4. The spider reads robots.txt, scans for 'allow' stanzas1 that apply to other spiders, and then masquerades as those in order to access the protected areas.

  5. The spider reads robots.txt and explicitly tries to scan prohibited areas.

The first case covers the Spiders in the White Hats, so we won't worry about it. Handling the others requires applying some intelligence to the process, which means recording what a particular bot is doing and making decisions based on its activities.

1 The original RES didn't support 'allow' stanzas, and not all RES-compliant bots recognize them. However, the basic issue is the same even for 'disallow' stanzas ��� a bot with evil intentions can conceivably change its access by pretending to be one of those for which you have explicit rules.

Sitemap | Contact Us