Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

[Home]SpiderTrap

MeatballWiki | RecentChanges | Random Page | Indices | Categories

Quite often for large sites like wikis that are generated dynamically, or for an OnlineCommunity that wants to remain more or less as a HiddenCommunity?, WebCrawlers (aka spiders) are a big problem. While the RobotsExclusionStandard is supposed to protect a site against wayward spiders, not all spiders adhere to this standard, particularly those searching for e-mail addresses to harvest for spam. Further, the RobotsExclusionStandard a very binary response: spiders are only allowed in or excluded altogether. Crafting a response particular to spiders is not an option.

To solve these problems, one needs to first detect them in order to respond. If you note, spiders cannot read semantically. They only read links, and then follow links. If you present a link that only a spider will follow, hopefully the spider will follow the link and your users won't. In this way, it is machine-oriented fly-paper.

Ways of presenting the link so that users won't click on it but spiders will are various.

It should be noted that Google hates people who hides links deviously like this on their sites. The simplest and best solution may simply be - a la HidingInPlainSight? and OpenProcess - to write a link <a href="http://...">Spider trap</a>, but then give curious cats a way out of the mess they have entered. On the next screen (or all the screens thereafter), offer opportunities for HumanVerification (e.g. a CAPTCHA).

Alternatively, there are SpiderTraps designed for spam-bots which need not be hidden from users, such as a SpamPoisoner.

You can make a SpiderTrap link a SelfBan as well, if you want to block all spiders. This is somewhat dangerous as unwitting users will routinely click on the URL, and it exposes the URL to would be attackers. If you do this, you may have to use some extra precautions. You could generate a nonce every time you display the SpiderTrap-SelfBan link, and expire it quickly. Another option is to allow people to unban themselves with HumanVerification.

You can combine the SpiderTrap with a SurgeProtector. If the spider trips the trap more than X times a minute, or X percentage of total hits, you block it then. This can help detect misbehaved spiders (like a wget) without banning GoogleBot? or Yahoo! Slurp. Additionally, this can help overcome the problem of users self-banning themselves. If someone is dumb enough to continuously load the SpiderTrap, perhaps they deserve being banned.

Warning. SearchEngineCloak detection algorithms are secret. They simply use a second spider to check to see the pages are similar. If you SpiderTrap one and not the other, your site may be banned. The solution is to put the SpiderTrap link in RobotsDotTxt? - after all, you are looking for bots that do not adhere to the RobotsExclusionStandard.


CategoryHardSecurity CategorySpam

Discussion

MeatballWiki | RecentChanges | Random Page | Indices | Categories
Edit text of this page | View other revisions
Search: