The robots.txt file
Since the beginning of Internet there is a need to index the Web
and many robots are built for this purpose. You already know
that famous Google bot which is indexing the Web to keep track
of urls and build a scheme out of it (link popularity
algorithm...).
There are not so many way to scan a website but some pages of a
website might not need to be crawled for any reasons such as
privacy...
A Standard for Robot Exclusion has been created and now
robots from search engines or others look forward the robots.txt
file before starting to scan a website. This file tells the
robots which links are allowed to be scanned and which links
shouldn't be indexed.
A good resource about the robots.txt file is at this address:
http://www.robotstxt.org
The site publishes information about Web robots, you may be
interested by this site if you plan to create your own Bot or
learn more about their history.
Practice You may have noticed from your server logs the
presence of the robots.txt request from an ant, it throws a file
not exist error when you don't have the robots.txt file. If you
just want to clean this from the log files then you need to
create this robots text file even if you make it empty.
The structure of this file is pretty simple, you can disallow
agents, you can disallow parts of your websites or only few
pages... Or you can deny everything or allow everything.
Here is an example from http://www.robotstxt.org:
User-agent: webcrawler Disallow:
User-agent: lycra Disallow: /
User-agent: * Disallow: /tmp Disallow: /log
The Webcrawler Bot can go anywhere. The second paragraph
indicates that the robot called 'lycra' has all relative URLs
starting with '/' disallowed. Because all relative URL's on a
server start with '/', this means the entire site is closed off.
The third paragraph indicates that all other robots should not
visit URLs starting with /tmp or /log. Note the '*' is a special
token, meaning "any other User-agent"; you cannot use wildcard
patterns or regular expressions in either User-agent or Disallow
lines.
Validator Once you are done with your robots.txt file you should
test it by using a robots.txt validator, there is one at this
address: Searchengineworld, robots check. The Searchengineworld
website provides also a more complete tutorial about the
robots.txt file.
Notes - The use of this file may reduce bandwidth consommation
by robots on your server. If you did disallow few pages, - It
also cleans a little bit your log files (1 line less by bots
scan), - The most important point is that this file is
recommended by Bots for duplicate websites. As you may get
penalties when you have duplicate sites, a solution is to deny
access to one side.
Specific Each Bot may act a little differently, so it's advised
to check faqs from every bots to learn more about their indexing
behaviour, for instance for the Yahoo slurp Bot, you can check
this Url: Yahoo Slurp Index.
The Msn bot: MSN Bot The Google bot: Google Bot
Here is a database of webrobots : Webrobots
Warning There are hackers who search the robots.txt files for
directories and files which should not be scanned, they are also
called 'bad robots'.
The solution is to either not mention the links and directories
to avoid or to put them in a special place where you add an
additional server protection.
Future of web agents Web agents job is becomming more and more
complex as the web grows, although technology is improving,
connections get faster and faster, cheaper and cheaper, cables
are getting busier and busier though.
There are heaps of websites geting online everydays and the web
agents must perform relevant indexing, do you remember the time
Google was crawling a new website in 1 day ?? I guess no !
I would not be surprised Web agents start to implement a kind of
selection and automatically avoid websites which are not HTML
valid... My advice: follow the rules, test your site, make it
conform with today's search engines guidelines...
The robots.txt may help those agents understand your site, so
use it, that will reward you later.
Thanks for reading, i hope this article has been useful for some
of you.