Why you need to limit bots

The use of bots and scrapers continues to surge, and they are pounding your web server resources, cloud or not. Bot behavior is for the most part benign, but poorly-coded and malicious bots can hurt site speed and performance. They look like DDoS attacks. They may be part of a rival’s competition monitoring, or re-purposing your content and presenting it as their own.


Earlier this week Akamai wrote about a DDoS attack from a group claiming to be Lizard Squad.


Let’s examine how third-party content bots and scrapers are becoming more prevalent as developers seek to gather, store, sort and present massive amounts of data.


These meta searches typically use APIs to access data, but many now use screen-scraping to collect information.


Using your HTACCESS file, you can start limiting and mitigating these leeches and mitigate their malicious traffic:


# forbid traffic from specific referrers 
RewriteEngine on 
RewriteCond %{HTTP_REFERER} blackhatworth\.com [NC,OR]
RewriteCond %{HTTP_REFERER} priceg\.com [NC] 
RewriteRule .* - [F]
# redirect any request for anything from spamsite to different spam site
# sends 'em right back and the harder they hit the harder you hit back

RewriteCond %{HTTP_REFERER} ^http://.*darodar.*$ [NC]
RewriteRule .* http://www.blackhatworth.com [R]
RewriteCond %{HTTP_REFERER} ^http://.*ilovevitaly.*$ [NC]
RewriteRule .* http://econom.co [R]
RewriteCond %{HTTP_REFERER} ^http://.*econom.*$ [NC]
RewriteRule .* http://ilovevitaly.co [R]
RewriteCond %{HTTP_REFERER} ^http://.*buttons-for-website.*$ [NC]
RewriteRule .* http://forum.topic54047091.darodar.com [R]
RewriteCond %{HTTP_REFERER} ^http://.*7makemoneyonline.*$ [NC]
RewriteRule .* http://buttons-for-website.com [R]
RewriteCond %{HTTP_REFERER} ^http://.*hulfington.*$ [NC]
RewriteRule .* http://priceg.com [R]
#scanner bot & malicious input blocker 
#courtesy of the 5G blacklist at perishablepress.com

<ifmodule mod_rewrite.c>
RewriteCond %{HTTP_USER_AGENT} (%0A|%0D|%27|%3C|%3E|%00) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (;|< |>|'|"|\)|\(|%0A|%0D|%22|%27|%28|%3C|%3E|%00).*(libwww-perl|python|nikto|curl|scan|java|winhttp|HTTrack|clshttp|archiver|loader|email|harvest|extract|grab|miner) [NC,OR]
RewriteCond %{HTTP:Acunetix-Product} ^WVS
RewriteCond %{REQUEST_URI} (< |%3C)([^s]*s)+cript.*(>|%3E) [NC,OR]
RewriteCond %{REQUEST_URI} (< |%3C)([^e]*e)+mbed.*(>|%3E) [NC,OR]
RewriteCond %{REQUEST_URI} (< |%3C)([^o]*o)+bject.*(>|%3E) [NC,OR]
RewriteCond %{REQUEST_URI} (< |%3C)([^i]*i)+frame.*(>|%3E) [NC,OR]
RewriteCond %{REQUEST_URI} base64_(en|de)code[^(]*\([^)]*\) [NC,OR]
RewriteCond %{REQUEST_URI} (%0A|%0D|\\r|\\n) [NC,OR]
RewriteCond %{REQUEST_URI} union([^a]*a)+ll([^s]*s)+elect [NC]
RewriteRule ^(.*)$ [R=301,L]
# copy and paste these security mitigation tools for bad bots into your htaccess file


Understanding the different categories of third-party content bots, how they affect a website, and how to mitigate their impact is an important part of building a secure web presence.


Specifically, Akamai has seen bots and scrapers used for such purposes as:


  • Setting up fraudulent sites
  • Reuse of consumer price indices
  • Analysis of corporate financial statements
  • Metasearch engines
  • Search engines
  • Data mashups
  • Analysis of stock portfolios
  • Competitive intelligence
  • Location tracking

During 2014, there was an observed, substantial increase in the aggregate traffic of bots and scrapers hitting the travel, hotel and hospitality sectors. The growth in these nuisance spiders targeting these sectors is likely driven by the rise of rapidly developed mobile apps that use scrapers as the fastest and easiest way to collect information from a web of sites to reassemble the information (pricing, availability, etc) for the app user.

Scrapers target room rate pages for hotels, pricing and schedules for airlines. In many cases that were investigated, scrapers and bots made several thousand requests per second, far in excess of what can be expected by a human using a web browser.


An interesting development in the use of headless browsers is the advent of companies that offer scraping as a service, such as PhantomJs Cloud. These sites make it easy for users to scrape content and have it delivered, lowering the bar to entry and making it easier for unskilled individuals to scrape content while hiding behind a service.


For each type of bot, there is a corresponding mitigation tactic.


The key to mitigating aggressive, undesirable bots is to reduce their efficiency. In most cases, highly aggressive bots are only helpful to their controllers if they can scrape a lot of content very quickly. By reducing the efficiency of the bot through rate controls, tar pits or spider traps, bot-herders can be driven elsewhere for the data they need.


Aggressive but desirable bots are a slightly different problem. These bots adversely impact operations, but they bring a benefit to the organization. Therefore, it is impractical to block them fully. Rate controls with a high threshold, or a user-prioritization application (UPA) product, are a good way to minimize the impact of a bot. This permits the bot access to the site until the number of requests reaches a set threshold, at which point the bot is blocked or sent to a waiting room. In the meantime, legitimate users are able to access the site normally.

15 karma points