Controlling Search Engines with Robots.txt
As mentioned in previous articles, search engines can be great
source of traffic to a standard business or personal website.
What would happen though, if you didn't want to appear in them?
This is the purpose of robots.txt files.
While they generally do not help you get listed, they can help
ensure that you don't get listed if you wish not to be. What is
a robot?
A robot (also shortened to just "bot", or called a spider) is a
computer that goes around collecting information from websites.
Different bots do different things, depending on the owners
reasons for having them. In the case of search engines, the
robots purpose is to collect information about what your site
contains ready for it to be included in the search engine. So
where does the robots.txt file fit in?
Search engines generally like to respect the owners of websites.
Most like to provide people the option of not including some or
all of their pages on their site in the search engine. The
robots.txt file is used for telling them.
Before the bot goes around your site looking at the various
pages you have, it will take a look inside your robots.txt file
first to see if it is allowed to.
If the bot doesn't find a robots.txt file, or the file is blank,
it will normally assume you don't want any robot blocked and
feel free to roam around your site. So how do I control where it
can go?
Robots.txt files can either specify individual robots to
restrict, or cover them all with the one command.
Commands for robots consist of two parts:
* User-agent: used for the name of the robot to control *
Disallow: where they are banned from accessing
In the example below, we would block robots called googlebot
from accessing greentree.html. Googlebot is the name of Google's
search engine robot, and by blocking it from this page we would
remove it from Google next time they update their results.
User-agent: googlebot Disallow: greentree.html
While this works great for that individual page, what if we
wanted to block it from all pages? It would be highly
inefficient to list every page on your site as blocked, but we
could do:
User-agent: googlebot Disallow: greentree.html Disallow: /frogs/
The above code would block googlebot from accessing
greentree.html and every page in the frogs directory.
Still the whole site would not be blocked, but we have already
reduced the areas that can be seen significantly. To block the
whole site we disallow the "/" directory. This "/" directory is
absolutely everything on the site.
For example:
User-agent: googlebot Disallow: /
You now have the ability to block as many bots as you like by
naming each one individually down the file. In the case below we
have banned googlebot and slurp (the name of Yahoo's robot) from
the site.
User-agent: googlebot Disallow: /
User-agent: slurp Disallow: /
Finally, if the same rules apply to all bots we can specify them
with the "*" character instead.
User-agent: * Disallow: /
Finally, it is worth mentioning that while almost every bot
likes to play nicely with the websites it visits, there are some
that do not. If you have pages that really shouldn't be seen my
any sort of robot, then perhaps you should use a method of
password protecting them.
About the author:
David Fitzgerald is a network administrator for the cheap web
hosting and domain">www.cheap-web-site-hosting.com.au/cgi-bin/domains/d
omains.pl">domain name registration services of Cheap Web Site
Hosting.