Robots.txt or how to get your site properly spidered, crawled,
indexed by bots
So you heard about someone stressing the importance of the
robots.txt file, or noticed in your website's logs that the
robots.txt file is causing an error, or somehow it is on the
very top of the top visited pages, or, you read some article
about the death of the robots.txt file and about how you should
not bother with it ever again. Or maybe you never heard of the
robots.txt file but are intrigued by all that talk about
spiders, robots and crawlers. In this article, I will hopefully
make some sense out of all of the above.
There are many folks out there who vehemently insist on the
uselessness of the robots.txt file, proclaiming it obsolete, a
thing of the past, plain dead. I disagree. The robots.txt file
is probably not in the top ten methods to promote your
get-rich-fast affiliate website in 24 hours or less, but still
plays a major role in the long run.
First of all, the robots.txt file is still a very important
factor in promoting and maintaining a site, and I will show you
why. Second, the robots.txt file is one of the simple means by
which you can protect your privacy and/or intellectual property.
I will show you how.
Let's try to figure out some of the lingo.
What is this robots.txt file?
The robots.txt file is just a very plain text file (or an ASCII
file, as some like to say), with a very simple set of
instructions that we give to a web robot, so the robot knows
which pages we need scanned (or crawled, or spidered, or indexed
- all terms refer to the same thing in this context) and which
pages we would like to keep out of search engines.
What is a www robot?
A robot is a computer program that automatically reads web pages
and goes through every link that it finds. The purpose of robots
is to gather information. Some of the most famous robots
mentioned in this article work for the search engines, indexing
all the information available on the web.
The first robot was developed by MIT and launched in 1993. It
was named the World Wide Web Wander and its initial purpose was
of a purely scientific nature, its mission was to measure the
growth of the web. The index generated from the experiment's
results proved to be an awesome tool and effectively became the
first search engine. Most of the stuff we consider today to be
indispensable online tools was born as a side effect of some
scientific experiment.
What is a search engine?
Generically, a search engine is a program that searches through
a database. In the popular sense, as referred to the web, a
search engine is considered to be a system that has a user
search form, which can search through a repository of web pages
gathered by a robot.
What are spiders and crawlers?
Spiders and crawlers are robots, only the names sound cooler in
the press and within metro-geek circles.
What are the most popular robots? Is there a list?
Some of the most well known robots are Google's Googlebot, MSN's
MSNBot, Ask Jeeves's Teoma, Yahoo!'s Slurp (funny). One of the
most popular places to search for active robot info is the list
maintained at http://www.robots.org.
Why do I need this robots.txt file anyway?
A great reason to use a robots.txt file is actually the fact
that many search engines, including Google, post suggestions for
the public to make use of this tool. Why is it such a big deal
that Google teaches people about the robots.txt? Well, because
nowadays, search engines are not a playground for scientists and
geeks anymore, but large corporate enterprises. Google is one of
the most secretive search engines out there. Very little is
known to the public about how it operates, how it indexes, how
it searches, how it creates its rankings, etc. In fact, if you
do a careful search in specialized forums, or wherever else
these issues are discussed, nobody really agrees on whether
Google puts more emphasis on this or that element to create its
rankings. And when people don't agree on things as precise as a
ranking algorithm, it means two things: that Google constantly
changes its methods, and that it does not make it very clear or
very public. There's only one thing that I believe to be crystal
clear. If they recommend that you use a robots.txt ("Make use of
the robots.txt file on your web server" - Google Technical
Guidelines), then do it. It might not help your ranking, but it
will definitely not hurt you.
There are other reasons to use the robots.txt file. If you use
your error logs to tweak and keep your site free of errors, you
will notice that most errors refer to someone or something not
finding the robots.txt file. All you have to do is create a
basic blank page (use Notepad in Windows, or the most simple
text editor in Linux or on a Mac), name it robots.txt and upload
it to the root of your server (that's where your home page is).
On a different note, nowadays, all search engines look for the
robots.txt file as soon as their robots arrive on your site.
There are unconfirmed rumors that some robots might even 'get
annoyed' and leave, if they don't find it. Not sure how true
that is, but hey, why not be on the safe side?
Again, even if you don't intend to block anything or just don't
want to bother with this stuff at all, having a blank robots.txt
is still a good idea, as it can actually act as an invitation
into your site.
Don't I want my site indexed? Why stop robots?
Some robots are well designed, professionally operated, cause no
harm and provide valuable service to mankind (don't we all like
to "google"). Some robots are written by amateurs (remember, a
robot is just a program). Poorly written robots can cause
network overload, security problems, etc. The bottom line here
is that robots are devised and operated by humans and are prone
to the human error factor. Consequently, robots are not
inherently bad, nor inherently brilliant, and need careful
attention. This is another case where the robots.txt file comes
in handy - robot control.
Now, I'm sure your main goal in life, as a webmaster or site
owner is to get on the first page of Google. Then, why in the
world would you want to block robots?
Here are some scenarios:
1. Unfinished site
You are still building your site, or portions of it, and don't
want unfinished pages to appear in search engines. It is said
that some search engines even penalize sites with pages that
have been "under construction" for a long time.
2. Security
Always block your cgi-bin directory from robots. In most cases,
cgi-bin contains applications, configuration files for those
application (that might actually have sensitive information),
etc. Even if you don't currently use any CGI scripts or
programs, block it anyway, better safe than sorry.
3. Privacy
You might have some directories on your website where you keep
stuff that you don't want the entire Galaxy to see, such as
pictures of a friend who forgot to put clothes on, etc.
4. Doorway pages
Besides illicit attempts to increase rankings by blasting
doorways all over the internet, doorway pages actually do have a
very morally sound usage. They are similar pages, but each one
is optimized for a specific search engine. In this case, you
must make sure that individual robots do not have access to all
of them. This is extremely important, in order to avoid being
penalized for spamming a search engine with a series of
extremely similar pages.
5. Bad bot, bad bot, what'cha gonna do...
You might want to exclude robots whose known purpose is to
collect email addresses, or other robots whose activity does not
agree with your beliefs on the world.
6. Your site gets overwhelmed
In rare situations, a robot goes through your site too fast,
eating your bandwidth or slowing down your server. This is
called "rapid-fire" and you'll notice it if you are reading your
access log file. A medium performance server should not slow
down. You may however have problems if you have a low
performance site, such as one running of your personal PC or
Mac, if you run poor server software, or if you have heavy
scripts or huge documents. Is these cases, you'll see dropped
connections, heavy slowdowns, in extremes, even a complete
system crash. If this ever happens to you, read your logs, try
to get the robot's IP or name, read the list of active robots
and try to identify and block it.
What's in a robots.txt file anyway?
There are only two lines for each entry in a robots.txt file,
the User-Agent, which has the name of the robot you want to give
orders or the '*' wildcard symbol meaning 'all', and the
Disallow line, which tells a robot all the places it should not
touch. The two line entry can be repeated for every file or
directory you don't want indexed, or for each robot you want to
exclude. If you leave the Disallow line empty, this means you
are not disallowing anything, in other words, you are allowing
the particular robot to index your entire site. Some examples
and a few scenarios should make it clear:
A. Exclude a file from Google's main robot (Googlebot):
User-Agent: Googlebot
Disallow: /private/privatefile.htm
B. Exclude a section of the site from all robots:
User-Agent: *
Disallow: /underconstruction/
Note that the directory is enclosed between two forward slashes.
Although you are probably used to see URLs, links and folder
references that do not end with a slash, note that a web server
always needs a slash at the end. Even when you see links on
websites that do not end with a slash, when that link is
clicked, the web server has to do and extra step before serving
the page, which is adding the slash through what we call a
redirect. Always use the ending slash.
C. Allow everything (blank robots.txt):
User-Agent: *
Disallow:
Note that when a "blank robots.txt" is mentioned, it is not a
completely blank file, but it contains the two lines above.
D. Do not allow any robot on your site:
User-Agent: *
Disallow: /
Note that the single forward slash means "root", which is the
main entrance to your site.
E. Do not allow Google to index any of your images (Google uses
Googlebot-Image for images):
User-Agent: Googlebot-Image
Disallow: /
F. Do not allow Google to index some of your images:
User-Agent: Googlebot-Image
Disallow: /images_main/
Disallow: /images_girlfriend/
Disallow: /downloaded_pix/
Note the use of multiple disallows. This is allowed, no pun
intended.
G. Build a doorway for Google and Lycos (the Lycos robot is
called T-Rex) - do not play with this unless you are 100% sure
you know what you are doing:
User-Agent: T-Rex
Disallow: /index1.htm
User-Agent: Googlebot
Disallow: /index2.htm
H. Allow only Googlebot..
User-Agent: Googlebot
Disallow:
User-Agent: *
Disallow: /
Note that the commands are sequential. The example above reads
in English: Let Googlebot through, then stop everyone else.
If your file gets really large, or you just feel like writing
notes for yourself or for potential viewers (remember,
robots.txt is a public file, anyone can see it), you can do so
by preceding your comment with a # sign. Although according to
the standard, you can have a comment on the same line with a
command, I recommend that you start every command and every
comment on a new line, this way, robots will never be confused
by a potential formatting glitch. Examples:
This is correct, as per the standard, but not recommended (a
newer robot or a badly written one might read the following as
"disallow the # We... Directory", not complying to the "disallow
all" command):
User-Agent: * Disallow: / # We decided to stop all robots but we
were very silly in typing a long comment which got truncated and
made the robots.txt unusable
The way I recommend that you format this is:
# We decided to stop all robots and we made sure
# that our comments do not get truncated
# in the process
User-Agent: *
Disallow: /
Although theoretically, each robot should comply to the
standards introduced around 1994 and enhanced in 1996, each
robot acts a little differently. You are advised to check the
documentation provided by the owners of those robots, you'll be
surprised to discover a world of useful facts and techniques.
For instance, from Google's site we learn that Googlebot
completely disregards any URL that contains "&id=".
Here are some sites to check:
Google: http://www.google.com/bot.html Yahoo:
http://help.yahoo.com/help/us/ysearch/slurp/ MSN:
http://search.msn.com/docs/siteowner.aspx A database of robots
is maintained at
http://www.robotstxt.org/wc/active/html/contact.html A
robots.txt validation tool - invaluable in finding potential
typos that can completely change the way search engines see your
site, can be found at:
http://searchengineworld.com/cgi-bin/robotcheck.cgi
There are also some extensions to the standard. For example,
some robots allow wildcards in the Disallow line, some even
allow different commands. My advice is: don't bother with
anything outside the standard and you will not be unpleasantly
surprised.
A final word of caution:
In this article I showed you how things should work in a perfect
world. Somewhere along this article I mentioned that there are
good bots and bad bots. Let's stop for a moment and think from a
deranged person's perspective. Is there anything to prevent one
from writing a robot program that reads a robots.txt file and
specifically look at pages that you marked as "disallowed"? The
answer is absolutely not, this entire standard is based on the
honor system and is based on the concept that everyone should
work hard to make the internet a better place. Basically, do not
rely on this for real security or privacy. Use passwords when
necessary.
In conclusion, do not forget that indexing robots are your best
friends. While you shouldn't build your site for robots, but for
your human visitors, do not underestimate the power of those
mindless crawlers - make sure the pages you want to be indexed
are clearly seen by robots, make sure you have regular
hyperlinks that robots can follow without roadblocks (robots
can't follow Flash based navigation systems, for instance). To
keep your site at tip top performance, to keep your logs clean,
your applications, scripts and private data safe, always use a
robots.txt file and make sure you read your logs to monitor all
robotic activity.