Controlling Search Engines with Robots.txt

Introduction As mentioned in previous articles, search engines can be great source of traffic to a standard business or personal website. What would happen though, if you didn't want to appear in them? This is the purpose of robots.txt files. While they generally do not help you get listed, they can help ensure that you don't get listed if you wish not to be. What is a robot? A robot (also shortened to just "bot", or called a spider) is a computer that goes around collecting information from websites. Different bots do different things, depending on the owners reasons for having them. In the case of search engines, the robots purpose is to collect information about what your site contains ready for it to be included in the search engine. So where does the robots.txt file fit in? Search engines generally like to respect the owners of websites. Most like to provide people the option of not including some or all of their pages on their site in the search engine. The robots.txt file is used for telling them. Before the bot goes around your site looking at the various pages you have, it will take a look inside your robots.txt file first to see if it is allowed to. If the bot doesn't find a robots.txt file, or the file is blank, it will normally assume you don't want any robot blocked and feel free to roam around your site. So how do I control where it can go? Robots.txt files can either specify individual robots to restrict, or cover them all with the one command. Commands for robots consist of two parts: * User-agent: used for the name of the robot to control * Disallow: where they are banned from accessing In the example below, we would block robots called googlebot from accessing greentree.html. Googlebot is the name of Google's search engine robot, and by blocking it from this page we would remove it from Google next time they update their results. User-agent: googlebot Disallow: greentree.html While this works great for that individual page, what if we wanted to block it from all pages? It would be highly inefficient to list every page on your site as blocked, but we could do: User-agent: googlebot Disallow: greentree.html Disallow: /frogs/ The above code would block googlebot from accessing greentree.html and every page in the frogs directory. Still the whole site would not be blocked, but we have already reduced the areas that can be seen significantly. To block the whole site we disallow the "/" directory. This "/" directory is absolutely everything on the site. For example: User-agent: googlebot Disallow: / You now have the ability to block as many bots as you like by naming each one individually down the file. In the case below we have banned googlebot and slurp (the name of Yahoo's robot) from the site. User-agent: googlebot Disallow: / User-agent: slurp Disallow: / Finally, if the same rules apply to all bots we can specify them with the "*" character instead. User-agent: * Disallow: / Finally, it is worth mentioning that while almost every bot likes to play nicely with the websites it visits, there are some that do not. If you have pages that really shouldn't be seen my any sort of robot, then perhaps you should use a method of password protecting them.