General Databases (71) Linux (42) Outside the Cube (2086) Programming (730) Web publishing (118) about DelphiFAQ (12) JavaScript (55) perl CGI (3) VBScript (1) Web Hosting (8) Windows (355)
Exchange Links About this site Links to us
|
Using robots.txt to block spiders crawling your web site
This article has not been rated yet. After reading, feel free to leave comments and rate it.
'Robots.txt' is a plain text file that through its name has special meaning to most decent robots on the web. By defining a few rules in this text file instruct robots to not crawl and index certain files or directories within your site.
If you do not want Google to crawl your site's /pictures folder, you can protect this folder from Google's crawler.
The following gives a few examples how to write a robots.txt file. It has to be placed in the www root directory of your server. On Linux boxes, this is typically /var/www/html.
The following example shows several versions of robots.txt files, separated by a line.
 | |  | |
; block Google's image crawler completely
User-agent: Googlebot-Image
Disallow: /
; block all spiders and bots from those 2 directories
User-agent: *
Disallow: /cgi-bin/
Disallow: /pictures/
; allow Googlebot to access everything except /cgi-bin
; and all other bots can access nothing
; finally allow ia_archive (alexa.com) to access everything!
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
User-agent: ia_archiver
Allow: /
| |  | |  |
Comments:
|
|
|
|
this is good stuff
|
|
|
|
|
Helpful examples - thanks
|
|