View Single Post
  #3  
Old 02-15-2007, 07:11 PM
Connie
Guest
 
Posts: n/a
#2 how to prevent crawling your website using robots.txt

From Wikipedia:
The robots exclusion standard or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is, otherwise, publicly viewable.

This part is not a complete introduction to the use and benefit of the file robots.txt which is a good tool to control bots and spiders (beside other purposes), it is a short introduction and lists usefull directives.

You can set different directives in that file, which must be placed in the root of your website (edit it with an ASCII-editor and upload it in ASCII-modus to your webspace)

As it would make no sense to block your website for all bots, indexing robots and search machines, it does make sense to block some of them explicitely

To stop Microsoft Search (Windows Live) to crawl your site completley, you can add this:
Quote:
User-Agent: MSNBot
Disallow:/
To stop Microsoft Search (Windows Live) to crawl your website like a amok-running idiot, you can add this to slow it down..
Quote:
User-Agent: MSNBot
Crawl-Delay: 36000
and another exotic directive especially for Micros&ft, to block Microsoft Search to show your website as a website preview, add this to your robots.txt:
Quote:
User-agent: searchpreview
Disallow:/
To stop Google Bot to index your page completely:
Quote:
User-agent: Googlebot
Disallow: /
To block all Bots to index the images- and the thumbnail-folder, set these:
Quote:
User-agent: *
Disallow: /images
Disallow: /thumbnails